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Abstract 

Feature selection has attracted significant attention in data mining and machine learning in the past decades. 
Many existing feature selection methods eliminate redundancy by measuring pairwise inter-correlation of 
features, whereas the complementariness of features and higher inter-correlation among more than two 
features are ignored. In this study, a modification item concerning the complementariness of features is 
introduced in the evaluation criterion of features. Additionally, in order to identify the interference effect 
of already-selected False Positives (FPs), the redundancy-complementariness dispersion is also taken into 
account to adjust the measurement of pairwise inter-correlation of features. To illustrate the effectiveness of 
proposed method, classification experiments are applied with four frequently used classifiers on ten datasets. 
Classification results verify the superiority of proposed method compared with five representative feature 
selection methods. 

Keywords: Classification, Feature selection, Relevance, Redundancy, Complementariness, 
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1. Introduction 

With the fast development of the world, the dimensional and size of data is fast-growing in most kinds of 
fields which challenge the data mining and machine learning techniques. Feature selection is an important 
and useful method that can effectively reduce the dimensionality of feature space while retaining a relatively 
high accuracy in representing the original data. Thus, it plays a fundamental role in many data mining 
and machine learning tasks, particularly in pattern recognition [1], knowledge discovery [2, 3], information 
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retrieval [4, 5], computer vision [6, 7], bioinformatics [8], and so forth. The effects of feature selection [9] have 
been widely recognized for its abilities in facilitating data interpretation, reducing acquisition and storage 
requirements, increasing learning speeds, improving generalization performance, etc. Therefore, feature 
selection has attracted significant attention of more and more researchers [10, 11, 12, 13, 14]. 

Generally speaking, the feature selection methods can be divided into two types: Wrapper and filter. 
Wrapper methods depend on specific learning algorithms. Thus the performance of wrapper methods is 
affected by the selected learning method. This may makes the wrapper methods computationally expensive 
in learning, since they must train and test the classifier for each feature subset candidate. Conversely, filter 
methods do not rely on any learning schemes. Instead, it is only based on some classifier-irrelevant metrics, 
including Fisher score [15], x 2 -test [16], mutual information [17, 18, 19, 20, 21], Symmetrical Uncertainty(SU) 
[22, 11], etc., to estimate the discrimination power of features. In this study, we only focus on filter methods. 

Filter methods can also divided into feature subset selection and feature ranking ones, with regard to 
their searching strategy. The evaluation unit for subset selection methods is a set of features, thus the 
one with best discrimination power is trying to be discovered [22, 23, 24, 25, 26]. Nevertheless, to find the 
best feature subset, 0(2 m ) candidate subsets (where to is # features in the original data) will be traversed 
for feature selection task cannot be solved optimally in polynomial-time unless P = NP [27]. Thus it is 
computationally intractable in practice. Unlike subset methods, feature ranking methods individually take 
features as the evaluation units and rank them according to their discrimination power [28, 29, 30]. These 
methods usually employ heuristic search strategies such as forward search, backward search, and sequential 
floating search. 

However, whatever feature ranking or feature subsets selection method, there are two problems possibly 
leading to the wrong ranking or lower capacity for classification. One is that the ignorance of feature 
interaction and dependencies may lead to redundancy, as some feature selection methods like MIM [31] take 
the assumption of independence of features. For the real-world datasets, especially those high-dimensional 
ones, such strong assumption may produce results far from optimal. The other problem is that group capacity 
of features is usually ignored, since many methods only measure the relationship between any two features 
[17, 32, 30]. For example, a feature has low individual classification capacity and is highly dependent on other 
features may be overlooked and even misidentified as a redundant feature by only measuring its pairwise 
relationship with other features. However, since it is highly dependent on other features, it is also possible 
that the feature contributes largely to the discrimination power of the subset consisting of such features. 
Thus, it should be evaluated as a salient feature and selected. Since the dependence among features is related 
to both redundancy and complementariness, it is imperative to develop more precise correlation analysis in 
order to distinguish them effectively. To this end, we propose a novel feature selection algorithm which tries 
to modify the redundancy analysis applied in prior methods in this paper by introducing a modification item 
and a dynamic coefficient to effectively adjust the redundancy-complementariness identification process. 
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The remainder of the paper is organized as follows: Section 2 reviews related work. Section 3 presents 
the Information theoretic metrics and evaluation criteria. A new feature selection method is included in 
section 4. In section 5, experimental study is described and the results are discussed. Finally, section 6 
concludes this study and proposes possible further work. 

2. Related work 

In recent decades, many kinds of feature selection methods have been studied. In general, there are two 
aims in these feature selection methods. One is to search the most class-relevant features, the other is to 
remove redundancy. Most feature selection algorithms can effectively find relevant features [33, 29, 34]. A 
well-known example is Relief, which is developed by Kira and Rendell [28] . The main idea of Relief is to rank 
features in terms of the weight corresponding to their ability to both discriminate instances with different 
class labels and cluster those with same class labels based on the distance between instances. However, 
Relief method may be ineffective since similar weights of two or more features cannot be removed by this 
method. In other words, this implies that redundant features cannot be identified. A typical and widely 
used extension of Relief is ReliefF [35], which is competent to the noisy and incomplete datasets. However, 
it is still unable to remove redundant features. Redundant features are considered to have negative effects 
on the accuracy and speed of classification methods, hence many feature selection methods are proposed to 
address this problem by statistic-based merics [36, 37, 30, 23, 22]. For example, Correlation based Feature 
Selection (CFS) algorithm proposed by Hall et al. [23] adopts cor value to simultaneously measure a feature 
subset’s correlation to the class and inter-correlation among features in it. CFS selects the subset which 
obtains the maximum cor value. However CFS does not designate specific search approaches, thus how to 
select feature subsets still remains to be a problem. 

Minimum Redundancy and Maximum Relevance (mRMR) criterion and its variants [17, 32, 38, 30] 
apply information theoretic metrics to separately measure class-relevance and pairwise correlation between 
features. A comprehensive score consisting of the two indices is applied to evaluate and select features. 
Fast Correlation Based Feature selection algorithm (FCBF) proposed by Yu and Liu [22] is another typical 
method that separately handles relevance and redundancy. FCBF utilizes Symmetrical Uncertainty (SU) 
as the rnerci to represent class-relevance and pairwise correlation. If the class-relevance of a feature is lower 
than that of another and the correlation between them, it would be identified as a redundant features and 
thus to be removed. Recently, an extenuation of FCBF was proposed in order to identify redundant features 
more precisely [39] . All of the above mentioned methods take pairwise correlation as the redundancy index 
and identify features with high such index to be redundant, while ignoring 1) complementary correlation 
between features (which we will discuss detailed in section 3.2) and 2) correlation among more than two 
features, which still remain to be problems that impair the performance of feature selection. 
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Much effort has been made to tackle the former problem mentioned above [24, 40, 41, 42, 43, 20, 21, 
11, 44, 45]. Flueret [24] and Wang et al. [40] propose Conditional Mutual Information Maximization 
(CMIM) criterion for feature selection. CMIM harnesses Conditional Mutual Information (CMI) to measure 
the intensity of relevance and redundancy since CMI can implicitly identify complementary correlation 
between features, i.e. a large value of CMI(F\ C\F) implies 1) F is relevant to class C, and 2) F is highly 
complementary with F, many information theoretic feature selection methods apply it to build up their 
evaluation criteria [46, 42, 41, 47]. There are also several methods explicitly identifying redundancy and 
complementary correlation without CMI. Algorithms based on Joint Mutual Information (JMI) criterion 
[43, 44] take into account mutual information between pairs of features and class. Since the feature relevant 
to class and the one complementary to salient features will obtain high JMI values, they both will be identified 
as salient features and thus is more possible to be selected. Although the above mentioned methods try 
to recognize complementariness from the pairwise correlation of features, measuring pairwise correlation 
is actually an approximation to measuring the correlation among more than two features. Under this 
circumstance, features that is strongly complementary to the certain selected feature(s) but not significantly 
correlated with the feature group are possible to be selected using such approximation, which will in turn 
intervene the later selection process. 

3. Information theoretic metrics and evaluation criteria 

3.1. Entropy, mutual information, and conditional mutual information 

In this section, some essential information theoretic metrics used in our method will be described. The 
entropy, a fundamental unit of information, is used to quantify the uncertainty preset in the distribution of 
X , which is formed as 

H{X) = - £ p(x)logp(x), 
xex 

where x £ X denotes the possible value assignments of X , p(x) is the distribution of x (for convenience, we 
hereafter use the notation log to denote the base 2 logarithm instead of log 2 ). According to the probability 
theory, one can use conditional entropy to quantify the uncertainty one variable conditioned on another one. 
The conditional entropy of X given Y is defined as 

H(X\Y) = -'£'£ p(xy) logp(x\y), 

y£Y xeX 

Mutual Information (MI) between two random variables X and Y can be described as follows 

xeXyGY 

where x £ X and y £Y are the possible value assignments of X and Y, respectively. MI can be considered 
as the amount of information shared by two variables. In feature selection field, it is one of the most widely 
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used metrics for measuring the correlation intensity of two features. Note that the MI is a symmetric merci, 
i.e. I(X;Y ) = I{Y\X). I{X\Y) = 0 implies that X and Y are statistically independent. Conditional 
mutual information (CMI), which is an extension of MI for measuring the conditional dependence between 
two random variables given the third, is defined as 

i(x-,y\z) = Y. ? W E E pto w 

zez xeXyeY 1 ' 

I(X\Y\Z) can be interpreted as the information shared between X and Y given the value of a third 
variable (Z). MI and CMI can also be expressed with entropies as follows: 

I(X-Y) = H(X)-H(X\Y) 

and 

I(X- Y\Z) = H(X\Z) - H{X\Y , Z). 

3.2. Relevance, redundancy, and complementariness analysis 

The motivation of using MI to solve feature selection problem is that a larger MI between the feature 
and class should imply a potentially greater discrimination ability when using the feature. In addition, a 
commonly cited justification for using MI in feature selection is that MI can be used to write both an upper 
and lower bound on the Bayes error rate [48]. It can simply be applied as the criterion of a filter taking the 
form of 

J(F) = I{F; C), (1) 

where J(-) denotes the evaluation criterion, F denotes a candidate feature and C denotes the class. Intu¬ 
itively, the top m candidates which maximize J(-) could be selected, where m is a predefined number or 
decided by some stop criterion. In fact, this criterion takes the assumption that each feature is independent 
to all other features, which makes the criterion very efficient. However, such an assumption is so strong in 
practice that almost all the features may be mutually dependent to others, which makes the criterion shown 
in Eq.(l) be far from optimal. In general, it is widely recognized that a salient set of features should not 
only be individually relevant to class, but also should not be redundant to other features in the set. In order 
to identify redundancy, mRMR and its variants are proposed which can be generally formed as 

J(F) = D(F) - R(F) (2) 

where D{F ) represents relevance between F and class C, R(F) describes redundancy between F and the 
selected features in the subset S. Usually, like in mRMR [17], D(F) and R(F) take the forms of MI. 
This criterion can efficiently find the features with high class-relevance and low dependence with respect 
to each other in S. However, term “redundancy” not only implies that features are highly dependent to 
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each other, but also indicates which one would be substitutable, i.e. their discrimination power would 
be significantly impaired when some other feature(s) are(is) given. From this viewpoint, only considering 
dependence between features is not enough to effectively identify redundancy. In other words, a feature 
which is dependent on another may not definitely imply to be redundant. Instead, the two features may 
complementary to each other, i.e they would have stronger discriminatory power as a group (but may weak 
as individuals), particularly in microarray data analysis [49, 50]. To this end, a complementary modification 
item is introduced as 

J(F) = D{F) - (R(F) - M(F)) (3) 

where M(F) is an item to identify complementary correlation between F and selected features in S. In the 
context of MI, if R(C ) takes the form of I{F',F a ) (as in mRMR), M(F) could thus be denoted as 

e g I{F', F s \C), which represents the information shared between F and F s given class C. In order to 
illustrate this, we first show the relationship between R(F) and M(F ) as follows 


R{F) - M(F) 


I(F-,F S )-I(F;F S \C) 


E EpW) 1o « 

f,eF s feF 


P(ffs) 

p(f)p(fs) 


J2 p ^ E i c ) log 

cec fs&F s f£F 


P(ffs\c ) 

P(f\c)p(fs\c) 


EE E p(/ fsC) log 

cGCfeFf a &F a 


P(ffs) 

P(f)p(fs) 


P(f\c)p(fs\c) \ 
P(ffs\c) ) 


E E E PU'fsC) log 

ceCfGFf s £F a 


P(ffs)p(fc)p(f s c) 

P(f)p(fs)p(c)p(ff s c) 


J2J2J2 P(ffsC)log 
c£Cf£Ff a £F s 


p(fc) 
p{f)p(c ) 


p(f fs)p(f s c) \ 
P(ffsC)p(fs)J 


H^ c ) lo s 

f&Fcec 


P(fc) 

P(f)p{c) 


E EE P(ffsC) log 

f a £F a fGFcGC 


P{fc\fs ) 
P(f\fs)p{c\f 8 ) 


I{F-C)-I(F-C\F S ). 


( 4 ) 


We now explain R(F) — M(F) using Eq.(4), since the relationship between I(F-,C ) and I(F-,C\F S ) is 
straightforward: If I(F; C ) is much great than I(F; C\F S ), the relevance between F and class C would become 
significantly weak after given the information of F s . In other words, F is redundant to F s . Conversely, if 
I(F;C ) is much small than I(F;C\F S ), the relevance between F and class C would become significantly 
strong after given the information of F s . i.e. F is complementary to F a . Thus, R(F) — M(F) could be 
applied to simultaneously measure redundancy and complementary correlation: When R(F) — M(F) > 0, 
it captures the magnitude of redundancy between F and F s ; when R(F) — M(F) < 0, it captures the 
magnitude of complementary correlation between F and F s . In the context of MI, the following expression 
could be applied to be the evaluation criterion according to Eq.(3) 


J(.) = I(F- C) - Pair.Cor{F ; S) 


( 5 ) 
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where PairJCor{F\ S) takes the form of 

PairJCor(F- S) = ^ (I(F-,F S ) - I(F-F S \C)). (6) 

f s e s 

For the sake of convenience for the discussion in the following sections, we denote cor(F\F s ) = I(F-,F S ) — 
/(-F; F S |C) and thus Eq.6 can be rewriten as 

Pair-Cor{F\ S) = ^ cor(F;F s ). (7) 

s 

It is noted that although Eq.(7) can measure both redundancy and complementary correlation, it is still 
a pariwise-based criterion since it only catches the relationship between two features. Criteria that only 
concern pairwise correlation among features is also called first-order approximation in literature [51]. We 
will further discuss the limitation of Eq.(7) in detail in the next section. 

4. Feature selection with redundancy-complementariness dispersion 

4-1. Interference effect of false positives 

First-order approximation is a prevailing strategy that seems to bring the best trade-off between exe- 
cutional efficiency and the selected features quality. Yet ignoring the group effect of features is still known 
to be suboptimal although taking the pairwise relevance effect into account. As mentioned before, feature 
selection methods that only handle individual relevance take the assumption of mutual independence among 
features. Similarly, first-order approximation in redundancy analysis only concentrates on individual re¬ 
dundancy. In other words, it takes the assumption that all the selected features are mutually independent. 
Since the first-order approximation only identify pairwise correlation, it is not able to take high inter-feature 
correlation into account, thus may misidentify and select actually-redundant features (i.e. False Positives, 
which is denoted as FPs hereafter in the paper; Similarly, we use the term True Positives (TPs) to denote 
the selected actually-salient features hereafter in the paper), which will in turn intervene the later selection 
process. 

More specifically, only focusing on pairwise correlation may give chance to FPs to intervene the evaluation 
of candidates. Suppose the selected feature subset already contains FPs, the pairwise correlation between 
the candidate and each FP is an interference that prompts the candidate to be given unduly high status if 
such correlation is influential to the value of the evaluation criterion </(■). recall that the correlation between 
candidate and each selected features is denoted as cor(F;F s ) where F s £ S (S is the selected feature subset) 
and thus Pair.Cor(F ; S) = J2f es cor (Fi A,), the interference effect of FPs can be illustrated in two possible 
scenarios shown in Figs.l (a) and (b), where node in yellow, nodes in red, and nodes in green denote the 
candidate, FPs, and TPs, respectively. Distance between yellow node and any other node is in proportion to 
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Figure 1: Toy examples of interference effect of FPs. 


the strength of their pairwise correlation, e.g. a short distance corresponds to the complementary correlation, 
while long corresponds to the redundant correlation. 

Scenario 1: FPs are close to the candidate. As shown in Fig.l (a), most of TPs are distant to the 
candidate, which implies that the candidate is more likely to be redundant rather than complementary 
to FPs (which corresponds to positive cor value in terms of Eq.(4)) and thus it is possibly a redundant 
feature. However, as FPs are very close to the candidate, they are more likely to be complementary and 
the corresponding cor value tend to be negative. Under this circumstance, the complementary correlation 
between candidate and FPs impairs the reliability of the estimation of PairjCor{F\ S) and thus makes the 
candidate to be overestimated. 

Scenario 2: FPs are distant to the candidate. Fig.l (b) shows that most of TPs are close to the candidate. 
This implies that the candidate is more complementary to TPs and thus more likely to be a salient feature 
that should be selected. However, it is redundant to the distant FPs and the corresponding cor value tend 
to be positive, thus also impairs the reliability of the estimation of Pair-Cor(F; S) and makes the candidate 
to be underestimated. 

Actually, the interference effect of FPs revealed in the above scenarios can be depicted by the dissimilarity 
of the selected features. That is, the intensity of the interference effect of FPs depends on the amount of 
the dispersion of the correlation between candidate and the selected features. When a certain value of 
PairJJor(F ; S) is given, the correlation between F and FPs in S which is more likely to be complementary 
corresponding to larger negative cor values would lead to the correlation between F and TPs in S which 
is more likely to be redundant corresponding to larger positive cor values, and vice versa. We call such 
dissimilarity as redundancy-complementariness dispersion. As a heuristic, we apply standard deviation of 






cor to capture such dispersion in order to possibly identify the interference effect of FPs, for standard 
deviation is always the best index for risk estimation and instability identification. The standard deviation 
of cor(F; F s ) given the selected feature subset S takes the form of 


a(F;S) 


s)) 2 V 

l isi ) 


( 8 ) 


where n(F; S) is the mean value of cor(F; F s ) calculated as 


f*(F\ S) 


PairJJor(F ; S) 
IS] 


( 9 ) 


Thus, the smaller the value of cr(F 1 ;S), the less influential the interference effect of FPs. We try to find 
salient candidates not only with more complementariness and less redundancy, but also less redundancy- 
complementariness dispersion, i.e. a small value of o(F;S), to heuristically avoid the interference effect 
of FPs. To this end, we use cr(F\ S) to adjust the value of PairJCor. Recall that Pair-Cor simultane¬ 
ously measures two types of correlation, i.e. redundancy (where the value of Pair_Cor is positive) and 
complementariness (where the value is negative). Taking this into account, we use the following criterion 


Jrid = D(F ; C) - S) • PairJCor{F ; S) (10) 

where 

! l+cr(F;S) Pair.Cor(F] S) > 0 

( 11 ) 

1 - cr(F; S) Pair.Cor{F- S) < 0 

to evaluate and select features among candidates. Note that <f>{F\ S) is defined piecewise for different types of 
correlation. Also, we use 1 + <t(F; S) and 1 —cr(F; S) rather than er(F; S) as the coefficient of Pair-Cor (F; S) 
in order to reduce the estimation bias for cr(F; S) particularly when there are only a few features selected 

in S. 


4-2. Proposed method 

Based on the above analysis, we propose our feature selection framework shown in Fig. 2. It not only 
consider class-relevance and pairwise inter-correlation of features, but also take into account the effect 
of redundancy-complementariness dispersion. Similar to most of the feature selection methods, proposed 
method also applies a sequential forward searching strategy to select features. That is, only one candidate 
would be selected at each iteration. 

We show the pseudo code of proposed algorithm in Algorithm 1. 

Algorithm 1 contains a ‘repeat’ loop and two ‘for’ loops and a calculation process of cr(F; S) (line 11 in 
Algorithm 1) which takes at least |S| loops for the calculation. Thus the iteration complexity of Algorithm 
1 is 0(5 ■ |F| 2 ), where <5 is the predefined number of selected features. Since there is only one candidate to be 
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Figure 2: A new framework for feature selection. 


selected at the end of each iteration when traversing F s G S, we only need to get the additional information 
of the newly-added feature rather than traversing S again. As for the calculation of <r(F; S) (i.e. the variance 
of cor), we could taking an alternative formulation of variance, i.e. Var(X) = E{X 2 ) — E 2 (X) to make use 
of the loops in Algorithm 1. That is, to get a(F ; S) , we designate P to record the summation of cor 2 and 
Q to record the summation of cor, and then we have 


o(F-S) 


f p-Q 2 / |si y 
1 |S| ) 


Taking the above into account, we show the fast implementation of Algorithm 1 as Algorithm 2. 

By utilizing the additional information gained at the latest iteration, the complexity of Algorithm 2 
reduces to 0(<5 ■ |F|), which is more efficient than Algorithm 1. Thus, we implement proposed method 
according to Algorithm 2 in the experiments to verify the performance of RCDFS. 


5. Experiment study 

In order to evaluate the performance and effectiveness of proposed method, the most representative and 
well-performed feature selection methods (CMIM [24], mRMR [17], FCBF [22],MIM [31] and ReliefF [35]) 
are used to compare with proposed algorithm. The brief reviews on above five selected feature selection 
algorithm are described as follows. 

• CMIM (Conditional Mutual Information Maximization) [24]: This well-known algorithm makes use of 
CMI to simultaneously measure class-relevance and inter-correlation of features, applying the following 
function 

J(F) = min7(F; C\F) 

Fes 

as the evaluation criterion, taking the heuristic that F satisfying min^ gS I(F ; C\F) could best represent 
the conditioning set S. 
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Input: 

D /*dataset*/, F /*feature set*/, C /*class*/, 5 /*expected # features to be selected*/ 

Output 

: S /*selected feature subset*/ 

l Initialize S <— 0, k t— 1 

2 repeat 


3 

foreach F £ F do 

4 


Relevance <— I(F;C) 

5 


Pair-Cor <— 0 

6 


foreach F s £ S do 

7 



cor ^ I{F-F S ) - I(F-F S \C) 

8 



Pair-Cor <— Pair-Cor + cor 

9 


end 

10 


Calculate a(F; S) according to Eq.(8) 

11 


if Pair-Cor > 0 then 

12 



(j) i — 1 + cr(F; S) 

13 


else 

14 



(f> t— 1 — a (F; S) 

15 


end 

16 


J{F) t— Relevance — <fi ■ Pair-Cor 

17 

end 


18 

S t— S U {F} satisfying F = argmaxp e p J(F) 

19 

F t— F — {F} 

20 

k <— k + 1 

21 until k > 5; 

22 return S 


Algorithm 1: RCDFS: Redundancy-Complementariness Dispersion-based Feature Selection 


• mRMR (minimum Redundancy and Maximum Relevance) [17]: It is a very famous feature selection 
algorithm that uses MI to measure class-relevance and pairwise dependence. It selects feature satisfying 

J(F) = /(F;C)--| [ £ I(F;F S ) 

' ' F s e S 

in a greedy manner, where I(F;C) measures the class-relevance of F and ]T) F gS I(F; F s ) measures 
the average pairwise dependence between F and F s £ S. Note that we have already introduced it in 
section 3.2. 

• FCBF (Fast Correlation-Based Feature selection) [22]: In this algorithm, Symmetrical Uncertainty 
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Input: D /*dataset*/, F /*feature set*/, C /*class*/, 6 /*expected # features to be selected*/ 
Output: S /*selected feature subset*/ 

Initialize S <— 0, F new <— 0, A (F) <— 0 for VF G F, Pair-Cor(F) <— 0 for VF G F, k <— 0 
foreach F G F do 

Relevance(F) <— 7(F; C) 

end 

Fnew <— F satisfying F = argmaxp^p Relevance(F) 
s <— S U {Fnew} 

F <— F — {Fnew} 
k G- k + 1 


repeat 

foreach F G F do 

Relevance G- J(F; C) 

cor G- I(F; Fnew) - I(F-Fnew\C) 

A (F) G- A(F) + cor 2 
Pair JO or (F) G- Pair.Cor(F) + cor 
a(F ; S) <- ( A ( F )- Pa ^( F ) 2 /'s| ) § 
if Pair JO or (F) > 0 then 
<— 1 + n(F; S) 

else 

(j) <— 1 — cr(F; S) 

end 

J{F) <— Relevance(F) — <j> ■ Pair-Cor {F ) 

end 

Fnew F satisfying F = argmax^eF J(F) 

S S U {F neu ,} 

F t— F — {F nel „} 

k <— k + 1 


until k> 5 ] 
return S 


Algorithm 2: A fast implementation of RCDFS 


(SU) is used as the evaluation merci. It first ranks features in descending order. Then it eliminates 
redundant features in terms of an approximate Markov blanket criterion: If SU{F\\C) > SU(F 2 ;C) 
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and SU(Fi;C ) > SU(Fi\ F 2 ), F 2 is thus identified as a redundant feature of F\ and thus would be 
eliminated. For this method, we set the predefined threshold 7 = 0 as suggested by [22]. 

• MIM (Mutual Information Maximization) [31]: It is the most basic feature ranking algorithms based 
on mutual information that only concerns the class-relevance of features. We have also introduced it 
in section 3.2. It applies 

J(F) = I(F; C) 

as the criterion to select the top m features with the highest value of I(F-C). It is one of the most 
typical benchmark algorithms in the field of feature selection. 

• ReliefF [35] : It is a well-known distance-based feature ranking method that searches nearest neighbors 
of samples for each class label and then weights features in terms of how well they differentiate samples 
for different class labels. As for the parameter settings, we use 5 neighbors and 30 instances throughout 
the experiments as suggested by [35]. 

Weka (Waikato environment for knowledge analysis) [52] is chosen as the classification platform. Since 
FCBF, MIM, and ReliefF have already been integrated in Weka, we directly use them to generate datasets 
with their selected features before classification. CMIM, mRMR, and the proposed method are implemented 
in Java and with Weka interfaces. All experiments are conducted on a 2.60 GHz CPU, 8 GB RAM personal 
computer with Windows 7. 

5.1. Datasets 

In order to validate the performance of the proposed method, ten frequently used datasets are applied 
in our experiments, where six of them (mushroom, kr-vs-kp, sonar, multiple features ka, DNA, and isolet5) 
are well known UCI datasets and the rest (Colon Tumor, BCR_ABL, Prostate Cancer, and Breast Cancer) 
are gene microarray datasets with high dimensionality (i.e. containing more than 2000 features). General 
information of these datasets are summarized in Tab.l. For the continuous and mixed datasets, a supervised 
discretization method called MDL [53] is employed to discrete continuous features before feature selection 
and classification. 

5.2. Classifiers and Experimental settings 
5.2.1. Classifiers 

In our experiments, four famous and most frequently used classifiers - Naive Bayesian Classifier (NBC) 
[52], Support Vector Machine (SVM) [54], /c-Nearest Neighbor (/cNN) [55] and C4.5 decision tree [56] are 
adopted to generate classification error rate on the datasets with selected features preprocessed by different 
feature selection methods. We set k = 1 for fcNN and employ Gaussian RBF kernels for SVM. 
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Table 1: Description of datasets 


# 

Name 

# samples 

# features 

Type 

# classes 

1 

mushroom 

8124 

22 

nominal 

2 

2 

kr-vs-kp 

3196 

36 

nominal 

2 

3 

sonar 

208 

60 

nominal 

2 

4 

multiple features kahunen 

2000 

64 

numeric 

10 

5 

DNA 

3186 

180 

nominal 

3 

6 

isolet5 

1559 

617 

mixed 

26 

7 

Colon Tumor 

62 

2000 

numeric 

2 

8 

BCR.ABL 

112 

12559 

numeric 

2 

9 

Prostate Cancer 

34 

12601 

numeric 

2 

10 

Breast Cancer 

19 

24482 

numeric 

2 


5.2.2. Experimental settings 

First, we show the classification results of the four classifiers on 1 ,m selected features for each feature 
selection method, where m in our experiments is set to be min ^50, ^rj j . 10-fold cross validation is applied 
in this part. Note that the nature of the learning process of each classifier is different. Since we are interested 
in checking the quality of the selected features, independently from the type of classification rule applied, 
the average result of the four classifiers is thus reported. 

In addition, we compare the best classification results for the six feature selection methods among their 
selected features. That is, we check the average classification results for each feature selection method on 
the datasets with selected features ranging from 1 to min {50, |F| }, and report the best one. In order to 
achieve stable results, a (M = 10) x (N = 10)-fold cross-validation is applied, i.e. 10-fold validation will 
be conduct ten times for each classifier on each dataset. Thus, a total of one hundred result samples (i.e. 
average results from four classifiers) can be collected where each sample is an average classification result of 
the four classifiers. Finally, the average of one hundred samples is reported in our paper. Wilcoxon rank-sum 
test is applied to determine the statistical significance of the difference of the results (where the significant 
level is set to be 0.05). 

At last, to test the stability of the performance on different datasets, average classification results of 
different datasets, in ranges from 1 to 5, from 1 to 10, from 1 to 15, from 1 to 20, from 1 to 25, from 
1 to 30, from 1 to 35, from 1 to 40,from 1 to 45, and from 1 to 50 selected features, are reported and 
analyzed respectively for each classifier and feature selection method. Friedman test is applied to analyze 
the statistical significance of the results. These ten average classification results have been considered to be 
the approximate transitory period to reach a stable performance for the datasets used. 
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5.3. Experimental results and discussion 

Figs. 3-12 show the 10-fold cross-validation average test error rate of the different types of classifiers 
(NBC, SVM, kNN, and C4.5) on the ten datasets to illustrate the effectiveness of proposed method RCDFS, 
where the consecutive numbers of selected features are described by X axis, and the average test error 
rate is represented by Y axis. According to the results shown in Figs. 3-12, the superiority of RCDFS can 
be verified in the majority of cases. Particularly on seven datasets namely mushroom (Fig. 3), kr-vs-kp 
(Fig. 4), sonar (Fig.5), DNA (Fig. 7), Colon Tumor (Fig. 9), BCR_ABL (Fig.10), and Breast Cancer (Fig.12), 
RCDFS significantly outperforms CMIM, mRMR, FCBF, MIM, and ReliefF. More precisely, RCDFS usually 
perform better at the beginning of feature selection process on several datasets such as sonar (Fig.5) and 
Colon Tumor (Fig. 9). This is probably because the redundancy and complementariness are both considered 
by RCDFS, rather than only measuring pairwise redundancy like mRMR and CMIM or ignoring redundancy 
among features like MIM, FCBF and ReliefF. For other datasets, e.g. BCR_ABL dataset, the test error 
rate corresponding to RCDFS is higher than that to CMIM and mRMR on the first five selected features, 
whereas after the sixth feature being selected, RCDFS performs better (i.e. the test error is lower) than 
other methods and it is never exceeded, which is possibly due to the fact that the dispersion of redundancy- 
complementariness correlation becomes influential to feature evaluation process after several features being 
selected, i.e. the interference effect of FPs in the selected subset impairs the evaluation ability of the selected 
compared algorithms except for RCDFS. On the whole, it can also be seen that RCDFS selects less features 
corresponding to the lowest error rate than other methods (e.g. it corresponds to the best classification 
results only selecting five and sixteen features on Colon Tumor and DNA, respectively). It is also found that 
the performance of RCDFS is not always outstanding and sometimes inferior to CMIM (such as Fig. 11). 
This may also lie in the dispersion of the redundancy-complementariness correlation since there also exist 
alternative conditions leading to high dispersion rather than the variance between TPs and FPs. 

Tab. 2 records the number of features selected by each feature selection algorithm. We observe from the 
table that the average number of selected features of RCDFS (18.7) is smaller compared to other algorithms 
used in our experiment. This indicates the advantage of RCDFS that the best classification result can be 
obtained with a sufficiently small set of features. 

Tab. 3 show the average test error rate of NBC, SVM, fcNN, and C4.5 on ten datasets over (M = 
10) x (V = 10)-fold cross validation, respectively. For each dataset, Wilcoxon test is conducted to evaluate 
the statistical significance of the difference between the two groups of result samples, i.e. groups of the 
result samples that corresponds to RCDFS and any other feature selection method. In Tab. 3, “Err” column 
records the average test error rate of (M = 10) x (N = 10)-fold cross-validation, “p-val” column records the 
p-value associated with Wilcoxon test, where p-value less than 0.05 indicates the statistical significance of 
the difference between the two average values. Notation are used to show that the test error rate 

corresponding to the current feature selection method is significantly lower/higher than that to proposed 
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Figure 3: Accuracy comparison with different number of selected features on mushroom. 
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Figure 4: Accuracy comparison with different number of selected features on kr-vs-kp. 
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Figure 5: Accuracy comparison with different number of selected features on sonar. 


Table 2: Number of selected features corresponding to best performance 


# features 



RCDFS 

CMIM 

mRMR 

FCBF 

MIM 

ReliefF 

mushroom 

9 

5 

12 

3 

8 

7 

kr-vs-kp 

21 

35 

27 

4 

35 

34 

sonar 

10 

15 

11 

9 

17 

18 

multiple feature kahunen 

34 

31 

25 

32 

25 

23 

DNA 

16 

26 

18 

17 

18 

25 

isolet5 

49 

47 

50 

31 

49 

48 

Colon Tumor 

5 

7 

15 

13 

5 

46 

BCR.ABL 

8 

14 

10 

48 

31 

42 

Prostate Cancer 

19 

16 

3 

34 

40 

8 

Breast Cancer 

16 

21 

20 

32 

50 

36 

Avg. 

18.7 

21.7 

19.1 

22.3 

27.8 

28.7 


method (corresponding to “RCDFS” column) under the test. Bold value in each row indicates that it is the 
best result among six feature selection methods. The average error rate of ten datasets is given in the last 
row. 

As can be seen from Tab. 3, the average value of test error rate for the ten datasets shows that RCDFS 
outperforms other methods on mushroom, sonar, DNA, BCR_ABL, Prostate Cancer, and Breast Cancer 
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Figure 6: Accuracy comparison with different number of selected features on multiple features kahunen. 


datasets. According to the average test error rate of ten datasets given in the last row, the best one is 
obtained by our method (7.89) and the worst is by MIM (12.57). Also, the average test error rate of CMIM 
(8.12) is better than other algorithms (mRMR (8.71), FCBF (11.1), and ReliefF (11.62)). 

For further analysis, the diagram (Fig. 13) is applied to visualize the statistical significance of RCDFS 
comparing with the selected methods under four classifiers on ten datasets. The blue box in Fig. 13 describes 
that the test error rate of RCDFS is significantly better than the compared algorithms in current dataset. 
The yellow box represents that there is no significant difference between the results of RCDFS and the 
compared algorithm. The red box implies that the test error rate of RCDFS is significantly worse than the 
compared algorithms. The results shown in Fig. 13 indicate that RCDFS achieves better performance in 
most of datasets compared with selected feature selection algorithms. 

Tabs. 4-7 show the statistical significance of average error using Friedman test under different classifiers 
on ten datasets. Results in column k = 5, 10, 15, 20, 25, 30, 35, 40, 45, and 50 represents the average 
classification error in ranges from 1 to 5, from 1 to 10, from 1 to 15, from 1 to 20, from 1 to 25, from 1 to 30, 
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Figure 7: Accuracy comparison with different number of selected features on DNA. 
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Figure 8: Accuracy comparison with different number of selected features on isolet5. 
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Figure 9: Accuracy comparison with different number of selected features on Colon Tumor. 
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Figure 10: Accuracy comparison with different number of selected features on BCR-ABL. 
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Figure 11: Accuracy comparison with different number of selected features on Prostate Cancer. 
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Figure 12: Accuracy comparison with different number of selected features on Breast Cancer. 
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Table 3: Average classification error rate of the six classifiers on selected features with NBC, SVM, kNN and C4.5, 
and the result of Wilocxon test. 


# Dataset 

RCDFS 

CMIM 

mRMR 

FCBF 

MIM 

ReliefF 

Err 

Err 

p-val 

Err 

p-val 

Err 

p-val 

Err 

p-val 

Err 

p-val 

i 

0.32 

0.37 

0.207 

0.47 

0.000° 

23.26 

0.000° 

20.57 

0.000° 

0.39 

0.000° 

2 

5.32 

5.61 

0.015° 

5.14 

0.231 

5.91 

0.000° 

5.61 

0.015° 

5.21 

0.449 

3 

14.05 

16.05 

0.025° 

17.44 

0.000° 

18.63 

0.000° 

17.38 

0.001° 

16.56 

0.006° 

4 

10.05 

9.89 

0.317 

9.81 

0.143 

10.08 

0.874 

9.81 

0.143 

9.83 

0.144 

5 

5.98 

5.99 

0.959 

6.46 

0.011° 

8.16 

0.000° 

6.46 

0.011° 

10.79 

0.000° 

6 

25.13 

23.19 

0 . 000 * 

27.06 

0.000° 

23.62 

0 . 000 * 

37.89 

0.000° 

37.11 

0.000° 

7 

3.32 

2.83 

0.105 

4.02 

0.715 

2.37 

0.103 

7.26 

0.001° 

8.35 

0.000° 

8 

5.98 

5.99 

0.959 

6.46 

0.011° 

8.16 

0.000° 

6.46 

0.011° 

10.79 

0.000° 

9 

5.13 

5.58 

0.070 

5.71 

0.113 

5.96 

0.960 

6.71 

0.758 

8.54 

0.053 

10 

3.66 

5.65 

0.000° 

4.52 

0.106 

4.86 

0.021° 

7.58 

0.000° 

8.63 

0.000° 

Avg. 

7.89 

8.12 


8.71 


11.10 


12.57 


11.62 



o statistical degradation at significant level of 0.05. 

• statistical improvement at significant level of 0.05. 
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Figure 13: Average classification error rate comparison between RCDFS and the selected methods on the selected ten datasets. 


from 1 to 35, from 1 to 40,from 1 to 45, and from 1 to 50 features, respectively. Note that FCBF may select 
less features than other methods, e.g. it only selects 4 features on mushroom dataset, thus the average up to 
4 features is described in row k = 25, 30, 35, 40, 45, and 50. A very small p-val (i.e. p-val < 0.05) indicates 
the significant difference among the average values. In addition, we use S/N given in the last row of the 
tables to represent statistically significant/insignificant difference among the average values under Friedman 
test with significant level 0.05. Bold value in each column shows the best classification result among six 
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feature selection methods. 

Tab. 4 shows that the average test error rate of Na ive Bayesian Classifier (NB) corresponding to RCDFS 
is lowest among all methods and p-val is smaller than 0.05. This indicates that the performance of RCDFS 
is best using NBC classifiers with the number of selected features in all ranges. Similar to NB, the average 
test error rate of SVM corresponding to RCDFS shown in Tab. 5 is also lowest with the number of selected 
features in most of the ranges. In addition, the CMIM is superior to other methods with SVM in the ranges 
of k = 1 to 45 and to 50. From Tab. 6 and Tab. 7, the average error rates of fcNN and C4.5 corresponding 
to RCDFS are both the lowest and the p-val is also smaller than 0.05, which verifies the effectiveness of 
proposed method. 


Table 4: Average classification error rate for all databases with NB classifiers, and the result of the Friedman test. 


NB 

k = 5 

k = 10 

k = 15 

k = 20 

k = 25 

k = 30 

k = 35 

k = 40 

k = 45 

k = 50 

RCDFS 

18.91 

13.92 

11.66 

10.45 

10.56 

9.93 

9.48 

8.87 

8.59 

8.34 

CMIM 

19.34 

14.70 

12.68 

11.43 

11.48 

10.88 

10.40 

9.82 

9.51 

9.25 

mRMR 

19.42 

15.32 

13.23 

12.07 

12.30 

11.73 

11.32 

10.88 

10.57 

10.29 

FCBF 

20.26 

15.51 

13.22 

12.04 

12.30 

11.60 

11.26 

10.72 

10.59 

10.48 

MIM 

27.01 

21.11 

18.64 

17.11 

17.38 

16.50 

15.78 

15.52 

15.00 

14.56 

RcliefF 

29.28 

23.04 

19.99 

18.35 

18.95 

18.02 

17.29 

17.08 

16.43 

15.87 

p-val 

0.000 

0.000 

0.000 

0.000 

0.000 

0.000 

0.000 

0.000 

0.001 

0.000 


S 

S 

S 

S 

S 

S 

S 

S 

S 

S 

Table 5: 

Average classification 

error rate for all databases with SVM classifiers, and the result of the Friedman test. 

SVM 

k = 5 

k = 10 

k = 15 

k = 20 

k = 25 

k = 30 

k = 35 

k = 40 

A: = 45 

k = 50 

RCDFS 

18.83 

13.65 

11.41 

10.15 

10.34 

9.72 

9.32 

9.34 

9.24 

9.20 

CMIM 

18.87 

14.24 

12.14 

10.75 

10.85 

10.12 

9.57 

9.37 

8.96 

8.64 

mRMR 

19.64 

15.20 

12.83 

11.38 

11.57 

10.82 

10.25 

10.08 

9.71 

9.41 

FCBF 

20.81 

15.80 

13.49 

12.20 

12.46 

11.70 

11.29 

10.78 

10.59 

10.43 

MIM 

27.28 

21.15 

18.53 

16.60 

16.79 

15.61 

14.67 

14.75 

14.06 

13.49 

ReliefF 

29.10 

22.17 

18.68 

16.52 

16.86 

15.82 

14.92 

15.09 

14.40 

13.87 

p-val 

0.003 

0.000 

0.000 

0.000 

0.000 

0.001 

0.002 

0.001 

0.001 

0.002 


S 

S 

S 

S 

S 

S 

S 

S 

S 

S 
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Table 6: Average classification error rate for all databases with &NN classifiers, and the result of the Friedman test. 


fcNN 

lO 

II 

k = 10 

k = 15 

k = 20 

k = 25 

?T* 

II 

CO 

o 

k = 35 

O 

II 

k = 45 

k = 50 

RCDFS 

18.72 

14.27 

12.38 

11.22 

11.62 

11.11 

10.77 

11.19 

11.01 

10.89 

CMIM 

19.10 

14.90 

13.04 

12.02 

12.59 

12.21 

11.87 

12.37 

12.19 

12.03 

mRMR 

19.56 

15.61 

13.68 

12.64 

13.22 

12.64 

12.21 

12.63 

12.38 

12.17 

FCBF 

20.54 

16.45 

14.52 

13.51 

14.22 

13.70 

13.42 

13.27 

13.07 

12.92 

MIM 

26.94 

21.28 

18.88 

17.40 

18.14 

17.36 

16.74 

17.44 

16.97 

16.57 

ReliefF 

28.98 

22.81 

19.87 

18.05 

18.72 

17.91 

17.29 

18.02 

17.49 

17.04 

p-val 

0.003 

0.000 

0.000 

0.000 

0.000 

0.000 

0.000 

0.000 

0.000 

0.000 


S 

S 

S 

S 

S 

S 

S 

S 

S 

S 

Table 7: 

Average classification 

error rate for all databases with C4.5 classifiers, and the result of the Friedman test. 

C4.5 

k = 5 

k = 10 

k = 15 

k = 20 

k = 25 

k = 30 

k = 35 

k = 40 

k = 45 

k = 50 

RCDFS 

19.23 

15.50 

13.84 

12.91 

13.83 

13.50 

13.27 

14.22 

14.13 

14.08 

CMIM 

19.50 

16.48 

15.04 

14.12 

15.10 

14.78 

14.53 

15.51 

15.40 

15.29 

mRMR 

20.26 

17.72 

16.19 

15.33 

16.36 

15.97 

15.68 

16.83 

16.71 

16.57 

FCBF 

21.34 

17.91 

16.57 

16.05 

17.40 

17.18 

17.07 

17.57 

17.56 

17.57 

MIM 

27.47 

22.76 

20.84 

19.34 

20.37 

19.57 

18.91 

20.13 

19.77 

19.52 

ReliefF 

29.65 

24.22 

21.95 

20.68 

21.94 

21.32 

20.69 

21.90 

21.29 

20.81 

p-val 

0.019 

0.000 

0.000 

0.000 

0.000 

0.000 

0.001 

0.000 

0.000 

0.000 


S 

S 

S 

S 

S 

S 

S 

S 

S 

S 


6. Conclusions &; future work 

Relevance and redundancy are two important feature properties attracting much attention in the study 
of feature selection. Many algorithms eliminate redundancy by measuring pairwise inter-correlation between 
features, while they cannot identify the complementariness of features and the correlation among more than 
two features. Although the former problem can be effectively addressed by introducing a modification item, 
high inter-correlation of features still makes the result far from optimal. Specifically, pairwise approximation 
of high inter-correlation may misidentify and select FPs which will in turn impair the effectiveness of feature 
evaluation. In order to identify the interference effect of FPs, the redundancy-complementariness dispersion 
is taken into account in proposed method to adjust the measurement of pairwise inter-correlation of features. 
To illustrate the effectiveness of proposed method RCDFS, classification experiments are conducted with four 
frequently used classifiers on ten datasets. In the experiments, RCDFS is compared with five representative 
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feature selection methods namely CMIM, mRMR, FCBF, MIM, and RcliefF. Classification results have 
been proven to perform satisfactorily of RCDFS. To verify the stability of RCDFS, Wilcoxon test as well as 
Friedman test are adopted to assess the statistical significance of the differences among the results of the 
feature selection method. According to the test results, RCDFS performs better than the selected methods 
in most of the cases. 

Although the superiority of RCDFS has been verified in the experiments, there still remain challenges 
which are imperative to be solved in our future work. One is that how to properly set the weights of 
three objectives, i.e. coordinate relevance, redundancy-complementary, and dispersion of pairwise inter¬ 
correlation, is still needed to be studied. Possible directions include multi-objective programming and 
multi-index evaluation techniques such as data envelopment analysis. Additionally, since their is no causal 
relationship between FPs and the dispersion of pairwise inter-correlation, only concerning such dispersion 
may not always be effective in feature evaluation. Thus how to design more effective heuristics in the context 
of first-order approximation will be further studied in future. 
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